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Recognition and binding of specific sites on DNA by proteins is central for many cellular functions 
such as transcription, replication, and recombination. In the process of recognition, a protein 
rapidly searches for its specific site on a long DNA molecule and then strongly binds this site. 
Here we aim to find a mechanism that can provide both a fast search (1-fO sec) and high stability 
of the specific protein-DNA complex (K^ = 10'^^ - 10"* M). 

Earlier studies have suggested that rapid search involves the sliding of a protein along the DNA. 
Here we consider sliding as a one-dimensional (ID) diffusion in a sequence-dependent rough energy 
landscape. We demonstrate that, in spite of the landscape's roughness, rapid search can be 
achieved if ID sliding is accompanied by 3D diffusion. We estimate the range of the specific and 
non-specific DNA-binding energy required for rapid search and suggest experiments that can test 
our mechanism. We show that optimal search requires a protein to spend half of time sliding along 
the DNA and half diffusing in 3D. We also establish that, paradoxically, realistic energy functions 
cannot provide both rapid search and strong binding of a rigid protein. To reconcile these two 
fundamental requirements we propose a search-and-fold mechanism that involves the coupling of 
protein binding and partial protein folding. 

Proposed mechanism has several important biological implications for search in the presence of 
other proteins and nucleosomes, simultaneous search by several proteins etc. Proposed mechanism 
also provides a new framework for interpretation of experimental and structural data on protein- 
DNA interactions. 



I. INTRODUCTION 



The complex transcription machinery of cells is primarily regulated by a set of proteins, transcription factors (TFs), 
that bind DNA at specific sites. Every TF can have from one to several dozens of such specific sites on the DNA. 
Upon binding to the site, TF forms a stable protein-DNA complex that can either activate or repress transcription 
of nearby genes, depending on the actual control mechanism. Fast and reliable regulation of gene expression requires 
(1) fast (~1-10 sec) search and recognition of the specific site (referred to as the target or cognate site below) out 



of 10^ - 10^ possible sites on the DNA, and (2) stability of the protein-DNA complex {Kd 



10 



-15 



10"** M). In 



spite of its apparent simplicity, such a mechanism is not understood in depth, either qualitatively or quantitatively. 
Here we focus on a simpler case of bacterial TFs recognizing their cognate (target) sites on the naked DNA. Needless 
to say that eukaryotic protein-DNA recognition is significantly complicated by chromatin packing of the DNA and 
multi-subunit structure of TFs. Interest ingly, similar problems of specific bindi ng and binding rate arises in the 



context of oligonucleotides-DNA binding l|Lomakin and Frank-Kamenetskii , 



mm 



Vast amounts of experimental data availa ble these days provide the structures of protein-DNA complexes at 



atomic resolution in cry stals and in solution IjBell and Lewi 



Schumacher et al 



2lM 



mm 



Lewis et al. 



Luscombe et 



199^, binding constants for dozens of native and hundreds of mutated proteins ijGrillo et al 



1998; 
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1989|), calorimetry measurements l|Spolar and Record. Il994|) . and novel single- molecule experiments 



I 



. These experimental data contributed most significantly to our present un derstanding of protein- 
DNA interaction s ince the early work of von Hippel, Berg et. al. In a , series of pioneering articles I Berg and von Hippel 



1987 



Berg et al 



1981 



von Hippel and Berg , 



1988; 



Winter et al 



19891), they have created a conceptual basis for 



describing of both the kinetics and thermodynamics of protein-DNA interaction, which became a starting point for 
practically every subsequent theoretical work on the subject. 

We start by reviewing the history of the problem and describing the paradox of the "faster than diffusion" asso- 
ciation rate. Next, we present the classical model of protein-DNA "sliding" and explain how this model can resolve 
the paradox. We outline the problem that the sliding mechanism faces if the energetics of protein-DNA interactions 
are taken into account. Next we introduce our novel quantitative formalism and undertake in-depth exploration of 
possible mechanisms of protein-DNA interaction. 



A. " Faster than diffusion" search 



The p roblem of how a protein finds its target site on DNA has a long history. In 1970, Riggs et. al. IjRiggs et 
1970alb ^ measured the association rate of Lad repressor and its operator on DNA as ~ 10^" M"^s~^. This aston- 



ishingly high rate (as compared to other biological binding rates) was shown to be much higher than the maximal 
rate achievable by 3D diffusion. In fact, if a protein binds its site by 3D diffusion, it has to hit the right site on the 
DNA within b = 0.34 nm. (A shift by 0.34 nm would result in binding a site that is different from the native one by 



197J)), with a protein diffusion coefhcient of D^d 

kos = '^irDzob ■ 



vs CGCAATTC 


. Using the Debye-Smoluchowski equation 


( Bruinsma 


2002; 


Flvv 


5ierg et al. 


2002; 


Richter and Eigen. 


^cm^/s 


Elowitz et al. . 


1999 


) we get 



10^ M^^s-^ 



(1) 



This value for the association rate, relevant for in vitro measurements, corresponds to target location in vivo on a 
time scale of a few seconds, when each cell contains up to several tens of TF molecules. 

To resolve the discrepancy between the experimentally me asured rate of 10^*^ M~^s ~^ and the maximal rate of 



10^ M ^s ^ allowed by diffusion, Riggs et. al., Richter et . al. IjRichter and EigenLll974|) and later Winter, Berg and 



von Hippel ()Berg et al. 



von Hippel and Berd 1989^ suggested that the dimensionality of the problem changes 



during the search process. They concluded that while searching for its target site, the protein periodically scans the 
DNA by "sliding" along it. 



B. Sliding along the DNA 

If a protein performs both 3D and ID diffusion, then the total search process can be considered as a 3D search 
followed by binding DNA and a round of ID diffusion. Upon dissociation from the DNA, the protein continues 
3D diffusion until it binds DNA in a different place, and so on. Some experimental evidence supports this search 
mechanism. These include affinity of the DNA-binding proteins for any fragment of DNA (non-specific binding), 
single molecule experiments where ID diffusion has been observed and visualized, and numerous other experiments 
where the rate of specific binding to the target site has been significantly increased by lengthening non-specific DNA 
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surrounding the site IjKini et a/.l . Il987(l . What are the benefits and the mechanism of ID diffusion and what limits 
the search rate? 

Here we address this question and consider possible search mechanisms that involve both ID and 3D diffusion, 
where ID diffusion along the DNA proceeds along the rough energy landscape. Quantitative analysis of the search 
process brought us to the following four main results: 

• When the roughness of the binding energy landscape is greater than ~ iksT, the diffusion along the DNA 
becomes extremely slow with the protein unable to diffuse more than a few base-pairs. The total search process 
is prohibitively slow. 

• If the search proceeds by a combination of ID and 3D diffusion, non-specific binding to the DNA plays a very 
important role in controlling the balance between these two processes. The optimal energy of non-specific 
binding can provide the maximal search rate. Although faster than either 3D or ID search alone, optimal 
combination of 3D and ID diffusion cannot expedite the search if the roughness of the landscape is greater 
than - 2kBT. 

• Experimentally observed and biologically relevant rates of search can be reached only when ID sliding proceeds 
through a fairly smooth landscape with a roughness of the order of ksT. 

• Paradoxically, the stability of the protein-DNA complex at the target site requires a roughness of the binding 
energy landscape considerably larger than fc^T. Rapid search, however, by 1D/3D diffusion is impossible at 
such roughness. 

Finally, we formulate this "search speed-stability" paradox and suggest a search-and-fold mechanism that can 
resolve it. The paradox can be resolved if the DNA-binding protein has two distinct (conformational) states in 
which it exhibits two modes of binding. In the first, mode that has weaker binding and a smother landscape, it 
searches for its site. In the second recognition mode, that has larger roughness of the binding landscape, the protein 
tightly binds DNA sites. Correlation between the energy landscapes in the two modes and the energy difference and 
the barrier between the two protein conformations control the frequency of transition between the two modes and 
provides effective pre-selection of low-energy sites. 

We suggest that these modes correspond to two distinct conformational states of the protein-DNA complex (a more 
open complex in the search mode, and a more tight one in the recognition mode). Transition between the two states 
can include partial folding of the protein, water extrusion, change in the DNA conformation etc. Focusing on the 
conformation of the protein, and without loss of generality, we consider a partially unfolded (disordered) conformation 
and the folded conformation bound to the cognate site as the two conformations required by our model. In fact, a 
protein in the partially unfolded conformation may have fewer and/or weaker interactions with DNA allowing rapid 
sliding. Folded conformation, in turn, provides stronger and more specific interactions required for for tight binding. 

We also quantify the requirements of this two-mode mechanism to provide both rapid search and stability. Struc- 
tures of known DNA-binding protein are known to be flexible and have been reported to exhibit two or more distinct 
binding modes. This two-state mechanism also agrees well with the results of calorimetric experiments. 

Proposed search-and-fold mechanism is not limited to protein-DNA interaction providing a general framework for 
protein-ligand binding and demonstrating advantages of induced folding, a common theme in molecular recognition. 
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II. THE MODEL 
A. Search time 

In our model, the search process consists of N rounds of ID search (each takes time of rid,j,i = 1..A'') separated 
by rounds of 3D diffusion (r^cLi)- The total search time tg is the sum of the times of individual search rounds: 

N 

ts = (rid,, + T^d.i) ■ (2) 

2=1 

The total number of such rounds occurring before the target site is eventually found is very large, so it is natural to 
introduce probability distributions for the essentially random entities in the problem. The first obvious simplification 
that can be made without any loss of rigor is to replace T^d,i by its average r^d- Each round of ID diffusion scans 
a region of n sites (where n is drawn from some distribution p{n)). The time Tid{n) it takes to scan n sites can be 
obtained from the exact form of the ID diffusion law (see Appendix A). If, on average, n sites are scanned in each 
round, then the average number of such rounds required to find the site on DNA of length Af is = M/n. Using 
average values, we get a total search time of 

M 

ts (n, M) = — [rid (fi) + fg^] , (3) 
n 

From 10) it is clear that in general, ts {fi, M) is large for both very small and very large values of n. In fact if n is 
small, very few sites are scanned in each round of ID search and a large number of such rounds (alternating with 
rounds of 3D diffusion) are required to find the site. On the contrary, if n is large, lots of time is spent scanning a 
single stretch of DNA, making the search very redundant and inefficient. An optimal value nopt should exist that 
provides little redundancy of ID diffusion and a sufficiently small number of such rounds. For a given diffusion law 
Tidin), function ts (fi, M) can be minimized producing fiopt, the optimal length of DNA to be scanned between the 
association and the dissociation events ^. 



B. Protein-DNA energetics 



While diffusing along DNA, a TF experiences the binding potential U{s) of every site s it encounters. The e nergy 



of protein-DNA inte ractions is usually divided into two parts, specific and non-specific ( Berg and von Hippe 



Gerland et al 



\2MSl 



1987 



Ui = U{s = Si, ..Si+i-i 



En 



(4) 



where s describes a binding DNA sequence of length I. As its name suggests, the non-specific binding energy E^s 
arises from interactions that do not depend on the DNA sequence that the TF is bound to, e. g. interactions with 
the phosphate backbone. The specific part of the interaction energy exhibits a very strong dependence on the actual 
nucleotide sequence. Here and below we use the term "energy" referred to the change in the free energy related 
to binding AGf,. This free energy includes the entropic loss of translational and rotational degrees of freedom of 



^ Naturally, wc assume here that tijj (n) grows with n at least as 0{n^+"), with a > 0. 



5 



the protein and amino acids' side-chains, the entropic cost of water and ion extrusion from the DNA interface, the 
hydrophobic effect etc. 

The energy of specific protein-DNA interactions can be approximated by a weight matrix (also known as "PSS M" 
or " profile" ) where each nucleotide contributes independently to the binding energy I Berg and von HippeJ 



19871) : 



(5) 



where Sj is a base-pair in position j of the site and e(j, x) is the contribution of base-pair x in position j. Most of the 
known weight matrice s of TFs e(j,Sj) give rise to uncorrelated energies of overlapping neighboring sites, obtained 



by one base pair shift (|Gerland et al. 



200211 . Figure n] presents distributions of the sequence specific binding energy 



f{U) obtained for different bacterial transcription factors and all possible sites in the corresponding genome. The 
weight matrices for these transcription factors has been derived usin g a set of known binding sites and standard 
approximation I Berg and von HipDel 1987 : 



Stormo and Field! 



1998r i. Notice that for a sufficiently long site the 



distribution of the binding energy of random sites (or genomic DNA) can be closely approximated (see Fig. [T]l by a 
Gaussian distribution with a certain mean (U) and variance a^: 



fm = 



V2 



: exp 



2cr2 



(6) 



We also assume independence of the energy of neighboring (though overlapping) sites. Bindi ng energies calculated 



for bacterial TFs support this assumption. Other physical factors such as local DNA flexibility llErie et al 



1994) can 



creat e a correlated energy landscape providing a different mode of diffusion that we have described in l|Slutskv et 
200J. 



C. Diffusion in a sequence-dependent energy landscape 

The whole DNA molecule can thus be mapped onto one-dimensional array of sites {si}, each corresponding to a 
certain binding sequence comprising bases from the i-th to the {i + I — l)-th, / being the length of the motif (see 
Fig. 121 . At each site, there is a probability pi of hopping to site i -I- 1 and a probability qi of hopping to site i — 1. 
These probabilities depend on the specific binding energies Ui and Ui±i at the i-th site and at the adjacent sites 
respectively and are proportional to the corresponding transition rates, LUi^i+i and LDi^i-i. For the latter, it is most 
natural to assume the regular activated transport form 



1.0 otherwise 



(7) 



where v is the effective attempt frequency, (3 = {ksT) , ks is the Boltzmann constant and T is the ambient temper- 
ature. Having defined that, we have a one-dimensional random walk with position-dependent hopping probabilities. 

As has been shown in numerous papers throughout the last two decades, the properties of ID random walks can 
vary dramatically depending on the actual choice of probabilitie s lip,;j (for review, see e. g. (jBouchaud and Georges , 
1990|)). Here we employ the mean first-passage time formalism IjMurthv and Keha 119891) to derive the diffusion law 
Tid (n) for protein sliding along the DNA given the sequence-dependent binding energy ((T)). 
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FIG. 1 Spectrum of binding energy for three different transcription factors and the Gaussian approximation (sohd hne). 
III. RESULTS 

Using the model described above, we studied the foUowing problems: 

• How fast is the ID search on DNA as a function of the "roughness" a of the binding energy landscape? 

• How significant is the role of non-specific binding energy i?ns in determining the search time? 

• How fast is the search for the native site under conditions that provide stability to the protein-DNA complex 
at the target site? 



Diffusion along the DNA 

We state here the main results without a derivation (which can be found in the Appendix A). For a given set 
of probabilities liP|;}. the mean first-passage time (MFPT) from i = to i = L (in terms of number of steps) is 



I Murthv and Kehr 



1988) 



L-1 L-2 L-1 i 

fe=0 fc=0 i=k+l j=k+l 

where = qi/Pi- The relation |(SJ) gives the MFPT for one given realization of probabilities. Assuming that the 
specific binding energies {Ui} have a normal distribution with variance (see above), we plug the probabilities in 
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FIG. 2 The Model Potential. 

^ into ((HJ and after a somewhat lengthy but straightforward calculation, we obtain an expression for the MFPT 
averaged over genomic sequences for L 3> 1: 



-1/2 



(9) 



where tq is the reciprocal of the effective attempt frequency for hopping to a neighboring site. 

The main result is that the ID search by hopping to neighboring sites proceeds by normal diffusion with t 
/2Di4, where the diffusion coefficient 



(10) 



exhibits an exponential depend ence on the "roughn ess" of the binding energy landscape cr, dropping rapidly as a 
becomes greater than few ksT (jSlutskv et a/.U2004j) . Hence, rapid diffusion of a protein along the DNA is possible 
only if the roughness of the binding energy landscape is small compared to fcsT (/3cr < 1.5). This requirement 
imposes strong constraints on the allowed energy of specific binding interactions. 



A. Optimal time of 3D/1D search 

When ID scanning is combined with 3D diffusion, what is the optimal time a protein has to spend in each of 
the two regimes? To answer this question we compute the optimal number of sites the protein has to scan by ID 
diffusion in order to get the fastest overall search. Results of this section are rather general and are not limited to 
the particular scenario of slow ID diffusion on a rough landscape discussed above. 
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Each time the protein binds DNA it 



performs a ro und of ID diffusion. If the round lasts Tid then on average the 



protein scans n = ^16DidTid/TT bps. l|HughesL Il995^ By plugging this relation into Eq. for search time tg, and 
minimizing tg with respect to n, we get the optimal total search time and the optimal number of sites to be scanned 
in each round: 



tr = is(nopt) - Y\l^ """p* = ^^^^ 

This analysis brings us to the following conclusions. 

First, and most importantly, we obtain that in the optimal regime of search 

Tld("-opt) = T3d, (12) 

i.e. the protein spends equal amounts of time diffusing along non-specific DNA and diffusing in the solution. This 
striking result is very general, and is true irrespective of the values of diffusion coefficients Did or D^d, or size of the 
genome M . In fact it follows directly from the diffusion law n ^ \fT\d- More importantly this central result can be 
verified experimentally by either single- molecule techniques or by traditional methods. 

Also note that the optimal region of the DNA scanned in a single round of ID diffusion fiopt does not depend on 
M, i.e. is the same irrespective of the size of the genomes to be searched for a specific site. 

Second, the optimal 1D/3D combination reached at T\d = T^d leads to a significant speed up of the search process. 
In fact, an optimal 1D/3D search is fiopt times faster than a search by 3D diffusion alone, and M /fiopt times faster 
than a search by ID diffusion alone. For example, if the protein operates in the optimal 1D/3D regime and scans 
nopt — lOObp during each round of DNA binding, then the experimentally measured rate of binding to the specific 
site can be 100 times greater than the rate achievable by 3D diffusion alone. 

Third, we can estimate fiopt, the maximal number of sites a protein can scan in each round of ID search. If we set 
Did to its maximum, i. e. Did D^d and f^d ^ Id/Dad, with ^ 0.1/xm, we get 



'opt 



500 bp. (13) 



For a smaller ID diffusion coefficient, e. g. Did ~ ^3d/100, we get fi^p^ ^ 50bp. Again, single molecule experiments 
can provide estimates of these quantities for different conditions of diffusion. 

Finally, we obtain estimates of the shortest possible total search time. If M w 10^ bp and ID diffusion is at its 
fastest rate, i. e. Did ^ D^d — lO^^cm^/s, then using Eq. (|11|) we get 

t?' - YV2^f3<iTo ~ 5 sec, (14) 

where we estimate ro ^ a^/D^d ^ 10~® sec. 

One can a lso estimate the searc h time using in vitro experimentally measured binding rates in water fc^jf*""^ ~ 
10^'^M~^s~"'^ i Rjggs et al . 1970alh 'l. The diffusion coefficient of a protein in the cytoplasm is 10 — 100 times lower 



than that in water leading to the estimated binding rate of fccytopiasm ^ j^gs _ iq9^ ig i (|ggg Appendix D). From 
this we obtain the time it takes for one protein to bind one site in a cell of 1/im'' volume (i.e. [TF]sa lO^^M) as 

iff = (fc^y*°pi^'''"[TF])"^ - 1 - 10 sec. (15) 

One can see perfect agreement between our theoretical estimates and experimentally measured binding rates. 

As we mentioned above, there are usually several TF molecules searching in parallel for the target site. Naturally, 
in this case, the search is sped up proportionally to the number of molecules. 
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B. Diffusion of PurR on E.coli genome 

To check the appUcabiUty of the above considerations, we simulated one - dimensional diffusion of PurR tran- 
scription factor on the E. coli chromosome. 

The specific energy profile w as built using a weight matrix derived from 35 PurR bind ing sites following a standard 

3! ^ 



procedure described elsewhere IjBerer and von Hinnel . 119871) . l|Stormo and Field; 



The resulting energy profile 



is random and uncorrelated and has a standard deviation a ~ 6.5 fcsT. This profile was used as an input for 
calculating mean first passage time at different temperatures^. The result of these calculations is presented in Fig.O 
It is clear that when the roughness of the landscape becomes significant a > 2kBT, the diffusion proceeds extremely 
slowly. Only ^ 10 — 100 bp can be scanned by a TF when a — iksT . A natural requirement for sufficiently fast 
diffusion is, as before, a ^ ksT. 
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FIG. 3 The mean first passage time vs traveling distance for purR transcription factor on the binding landscapes of different 
roughness (or at different temperatures). The horizontal line indicates the optimal regime, T\d ~ Tsd. 



C. Non-specific binding 

While the diffusion of the TF molecules along DNA is controlled by the specific binding energy, the dissociation 
of the TF from the DNA depends on the total binding energy, i. e. on the non-specific binding as well as on the 
specific one. Moreover, since the dissociation events are much less frequent than the hopping between neighboring 



^ Since the magnitude of the interaction is fixed, in these calculations we vary temperature rather than binding strength 
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base-pairs (roughly by a factor of f^d/ {t}), the non-specific energy E^s makes a sensibly larger contribution to the 
total binding energy. 

For a TF at rest bound to some DNA site i, the dissociation rate would be given by the Arrhenius - type 
relation, 

1 



r, = _e-/3(s„s-(7i). 



(16) 



Given the specific Ui non-specific £'ns energy one can calculate the average time rid a protein spends before dissoci- 
ating from the DNA (see Appendix B). We obtain 



Ens = ksT 
and in the optimal regime where Tid = f^d 



Inl^ 



1 



To / 2 KksT 



ln(^ 



1/0- 



To / 2 V fcsT 



(17) 



(18) 



D. The parameter space 

Since for a given value of a, the non-specific binding controls the dissociation rate, the search time will deviate 
from the optimum if E^s moves from this predetermined value. In Fig. 0^ we plot the search time as a function of 
the non-specific binding energy for different values of a. 

We now define the tolerance factor ( as the ratio between the acceptable value of the search time tg and the 
optimal search time Experimental data suggest C ^ 5, but for the moment we allow for much larger values of 

~ 10 — 100 (this can be done when, for instance, there are many protein molecules searching in parallel). As we 
can see from Fig.0Ji, for each value of a, there is a range of possible values of E^s such that the resulting search time 
is within the region of tolerance (see Appendix B). Notice the dramatic increase in the search time as E^s deviates 
from its optimal value. 

Specifying C, we can define our parameter space, i. e. the values of specific and non-specific energy producing 
a total search time within the region of tolerance. In Fig. we consider three values of C- The most relaxed 
requirement ( = 100 provides a search time tg < 500 sec. If 100 proteins are searching for a single site, then the first 
one will find it after ^ 5 sec, leading however to a fairly low binding rate of kon ~ 1/500 sec- 10^ M^^ = 2-10® M^^s~^ 
(compared to experimentally measured 10^" M^^s^^ in water). Importantly, in order to comply with even this most 
relaxed search time requirement, the characteristic strength of specific interaction must be smaller than ^ 2.3 fcsT. 

These results bring us to a very important conclusion that a protein cannot find its site in biologically relevant time 
if the roughness of the specific binding landscape is greater than ^ 2 fcsT. Although an optimal 1D/3D combination 
can speed up the search, it cannot overcome the slowdown of ID diffusion. Only fairly smooth landscapes (ct ~ Ifc^T) 
can be effectively navigated by proteins. 



IV. SPEED VERSUS STABILITY 

While rapid search requires fairly smooth landscapes (ct ~ Ifc^T), stability of the protein-DNA complex, in turn, 
requires a low energy of the target site {Umin < 15 ksT for a genome of 10^ bp). 
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FIG. 4 (a) Dependence of the search time on the non-specific binding energy, (b) The parameter space. The dashed line 
corresponds to optimal parameters a and Ens connected by Eq. 11811 . 
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In Fig. EK, we pres ent the equilibrium probability Pf, of binding the strongest target site with energy U„ 



I Gerland et al 



Uo 



the target site: 



2002^ as a function of a/kBT. In equilibrium, Pb equals the fraction of time the protein spends at 

exp [-/3Uo] 



Ph 



Efio exp [-l3Ui] 

Since the target site is not separated from the rest of the distribution by a significant energy gap, Pj, is comparable 
to 1 (which is the natural requirement for a good regulatory site) only at a much greater than ksT. 

Figure ISJj shows the optimal search time at the corresponding values of a/ksT. High roughness of a >> ksT 
required for stability of the protein-DNA complex leads to astronomically large search times. In contrast, a protein 
can effectively search the target site at a smaller than 1 — 2kBT. 

This brings us to the central result that the ability to translocate rapidly along the DNA clearly cannot comply with 
the stability requirement. 

Requirement of high stability at the target site Pb ^ I (or Pb ^ ^/Np if Np copies of the protein are present) yields 
an estimate for the minimal cr, 



(19) 



cr - kBTV2 InM ~ 5 ksT, 



(20) 



given a genome size M = 10^. 

From the above analysis, an obvious conflict arises: the same energy landscape cannot allow for both rapid 
translocation and high stability of states formed at sites with the lowest energy. This conflict is similar to the speed- 



stability paradox of protein folding formulated by Gutin et.al. ijGutin et al. 



199g): rapid search in conformation space 



requires a smooth energy landscape, but then the native state is unstable. In protein folding, this conflict is resolved by 
the presence of a large en ergy gap between the native state and the rest of the conformations l|Finkelstein and Ptitsvn , 



20112; 



Pande et al. 



As evident from Fig. ^ no such energy gap separates cognate sites from the bulk of other (random) sites. In fact, 
the energy function in the form of ^ cannot, in principle, provide a significant energy gap. Increasing the number 
of TFs cannot resolve the paradox either (see Appendix D,E). An alternative solution must be sought. 



V. THE TWO-MODE MODEL 



The "search speed - stability" paradox has already been qualitatively anticipated by Winter, Berg and von Hippel 



I Winter et al. 



1989|) . who therefore concluded that a conformational change of some sort should exist that would 



allow fast switching between "specific" and "non-specific" modes of binding. In the non-specific mode, the protein is 
"sliding" over an essentially equipotential surface (in our terms, Unon-spcc — 0) whereas site-binding takes place in 
the "specific" mode (cTspoc S> ksT). A protein in the non-specific binding mode is "unaware" of the DNA sequence 
it is bound to. Thus, it should permanently alternate between the binding modes, probing the underlying sites for 
specificity. 

This model naturally rises a question about the nature of the conformation change. Originally, it was described 
as a "microscopic" binding of the protein to the DNA accompanied by wa ter and ion extrusion. However, numer- 
ous calorimetry measurements and calculations ( Spolar and Reco'rdl 1994) show that such a transition is usually 
accompanied by a large heat capacity change AC. This AC cannot be accounted for, unless additional degrees of 
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1 2 3 4 5 




FIG. 5 (a) Stability on the protein-DNA complex on the cognate site measured as the fraction of time in the bound state at 
equilibrium, (b) Optimal search time as a function of the binding profile roughness, for the range of parameters 10~*sec < 
Tsd < 10"^sec, 10"^°sec < tq < 10"^sec. 
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freedom, namely, protein fold ing;, are taken into account. On- s ite folding of the transcriptio n factor may involve 
significant struct ural change IjBruinsma . 



10-4. .10-<^ sec l Akke . 



2sm 



Flvvbierg et 



2(M 



Kalodimos et al 



20041) and take a time of 



2002|) (compared to a characteristic on-site time of tq ~ 10 ^..10 ^ sec). We conclude that 



conformational transition between the two modes involves (but not limited to) partial folding of the TF. 

If the TF is to probe every site for specificity in this fashion, it would take hours to locate the native site. We 
note, however, that if there was a way to probe only a very limited set of sites, i. e. only those having high 
potential for specificity, the search time would be dramatically reduced. From the previous section it is clear that a 
relatively weak site-specific interaction (i.e. smooth landscape, a ~ fcsT) does not affect significantly the diffusive 
properties of the DNA and the total search time. If this landscape, however, is correlated with the actual specific 
binding energy landscape (with ct ~ 5 — 6 ksT), the specific sites will be the strongest ones in both modes. The 
protein conformational changes should occur therefore mainly at these sites, which constitute "traps" in the smooth 
landscape. Since such sites constitute a very small fraction of the total number of sites, the transitions between the 
modes are very rare. 

We therefore suggest that there are two modes of protein-DNA binding: the search mode and the recognition 
mode (Fig.EJ. In the search mode, the protein conformation is such that it allows only a relatively weak site-specific 
interaction (cts ^ 1-0 — 2.0 fcsT) (Fig^jtop). In the recognition mode, the protein is in its final conformation and 
interacts very strongly (cti. > 5 ksT) with the DNA (Fig El bottom) . If two energy profiles are strongly correlated 
then the lowest lying energy levels ("traps") in the search mode (< —5 fc^T) are likely to correspond to the strongest 
sites in the recognition mode, putatively, the cognate sites. The transitions between the two modes happen mainly 
when the protein is trapped at a low-energy site of the search landscape. In this fashion, the ID diffusion coefficient 
Did is about 10-100 times smaller than the ideal limit, but the search time in the optimal regime is reduced only by 
a factor of 3 - 10 (see Eq. (tTT|l 'l. 

The coupling between the conformational change and association at a site with a low-energy trap is likely to take 
place through time conditioning. Namely, the folding (or a similar conformation transition ) occurs only if the protein 
spends some minimal amount of time bound to a certain site. This statement is basically equivalent to saying that 
the free energy barrier that the protein must overcome to transform to the final state must be comparable to the 
characteristic energy difference that controls hopping to the neighboring sites. 

The protein conformation in recognition mode should be stabilized by additional protein-DNA interactions. If these 
interactions are unfavorable, the folded structure is destabilized, then the search conformation is rapidly restored 
and the diffusion proceeds as before. If the new interactions are favorable, the folded structure is stable and the 
protein is trapped at the site for a very long time. 

For this mechanism to work, transition between the two modes of search has to be associated with significant 
change in the free energy (~ 5..10fcBT) of the protein-DNA complex (see FigEl^c)). Such energy difference between 
the two states is required to make most of the high-energy sites in the recognition mode less favorable than in the 
search mode. So a protein would rather (partially) unfold that bind an unfavorable site. As a result sites that lay 
higher in energy than a certain cutoff exhibit similar non-specific binding energy (i.e. switch into search mode of 
binding). Folding of partially disorder protein loops or helices can provide required free energy difference between 
the two modes. 

Efficiency of proposed search-and-fold mechanism depends on the energy difference between the two modes, corre- 
lation between the energy profiles and the barrier between the two states. The barrier determines the rate of partial 
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cognate site 

(a) (b) (c) 



FIG. 6 Cartoon demonstrating the two-mode search-and-fold mechanism. Top: search mode, bottom: recognition mode (a) 
two conformations of the protein bound to DNA: partially unfolded (top) and fully folded (bottom), (b) The binding energy 
landscape experienced by the protein in the corresponding conformations, (c) The spectrum of the binding energy determining 
stability of the protein in the corresponding conformations. 



folding-unfolding transition. If the barrier is too low, then the protein equilibrates while on a single site having 
no effect on search kinetics. On the contrary, too high barrier can lead to rear folding events and the cognate site 
can be missed. It can be shown that proper size of the barrier provides efficient search and stable protcin-DNA 
complex . Alternatively, the cognate site can lower the barrie r by stabilizing the transition state (i.e. the folding 



nucleus l|Abkevich et al. 



Mirnv and Shakhnovich . 



2001^) ) acting as a catalyst of partial folding. Quantitative 



analysis of these factors is beyond the scope of this study and will be published elsewhere. 



VI. DISCUSSION 

A. Specificity "for free" : kinetics vs thermodynamics. 



The proposed mechanism of specific site location is akin to kinetic proofreading l|Hor)field[ 19741) . which is a very 
general concept for a broad class of high-specificity biochemical reactions. The required specificity is achieved in 
kinetic proofreading through formation of an intermediate metastable complex that paves the way for irreversible 
enzymatic reaction. If the reaction is much slower than the life-time of the complex, then substrates that spend 
enough time in the complex are subject to the enzymatic reaction, while substrates that form short-lived complexes 
are released back to the solvent before the reaction takes place. In other words, the substrates are selected by kinetic 
partitioning. 

In contrast to kinetic proofreading that increases equilibrium specificity for the price of energy consumption, the 
search-and-fold model doesn't require any additional source of energy. The two mode search-and-fold model provides 
faster "on-rate" of binding while keeping the equilibrium binding constant unchanged. Naturally, the "off-rate" is 
increased as well. This makes our two-mode model thcrmodynamically "neutral" . 
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B. Coupling of folding and binding in molecular recognition 



Several DNA- and ligand-binding proteins are known to have partially unfolded (disordered) structures in the 
unbound state. The unstructured regions fold upon binding to the target. Does binding-induced folding provide any 
biological advantage? 

The idea of coupling between local folding and site binding has been arou nd for some time and was recently 



reass e ssed in the much broade r context of intrinsically unstructured proteins IjDvson and WriEfht , 



2002; 



mm 



Uverskv 



Wright and Dvsonl . Il999|) . Induced folding of these proteins can have several biological advantages. First, 



flexible unstructured domains have intrinsic plasticity allowing them to accommodate targets/ligands of various 
size and shape. Second, free energy of binding is required for compensation for entropic cost of ordering of the 
unstructured region. A poor ligand that doesn't provide enough binding free energy cannot induce folding and, 
hence, can not form a stable complex. Williams et.al. have suggested that unstructured domains can be result of 
evolutionary s election that acts on th e bound (structured) conformation, while ignoring the unbound (unstructured) 



conformation l|Williams e 



increase the binding rate IjLevv et 



22^K Part i al unfolding can also in crease protein's radius of gyration and, hence. 



2004; 



Shoemaker et al. 



2000) 



Here we propose a mechanism that suggests role of induced folding in providing rapid and specific binding. Induced 
folding (or other sort of two-state conformational transition) allows a protein to search and recognize DNA in two 
different conformations providing rapid binding to the target site. Importantly, this mechanism reconciles rapid 
search for the target site with stable bound complex (see above). The rate of induced folding can also play a role in 
determining the specificity of recognition (Slutsky and Mirny, in preparation). 

Structural and thermodynamic data argue in favor of distinct protein conformations for search along non-cognate 
DNA and for recognition of the target site. Proteins s uch as A cl, Eco RV and GCN4 apparently do not fold thei r 



1991; 



O'Neil et al. 



199C; 



Winkler et al 



1993) 



unstructured regions while bound to non-cognate DNA ijClarke et 
supporting our hypothesis. 

Heat capacity measurements on a vast variety of protein-DNA complexes report a large negative heat capacity 
change in site-specific recognition, which is a clear indication of a phase transition. These m easurements supplemented 
by X-ray crystallography and NMR structural data were interpreted by Spolar et. al. 1 Spolar and Record! 1994 ^ 
mainly in terms of hydrophobic and conformational contributions to entropy. Thus, folding - binding coupling is now 
considered a well established effect for a large set of transcription factors. 

However, real-time kinetic measurements were not performed until recently, so that the question of the actual 



mechanism was left open. Serious advances in this direction were made by Kalodimos et. al. ijKalodimos et 



2(M 



2m, 



200l[) . who observed a two-step site recognition by dimeric Lac repressor. The H/D-exchange NMR data 



unambiguously demonstrates site pre-selection by a-helices bound in the major groove followed by folding of hinge 
helices that bind to the minor groove elements and complete the specific site recognition. Though the experiments 
in this field were performed with a single model system, their implications are likely to have a general character. 

It should be mentioned, that no transition of this kind is observed when the protein is unbound from DNA. A 
possible reason for this can be a significant reduction of the free energy barrier for folding, entropic in essence, that 
accompanies protein-DNA association. Entropy barrier reduction is a natural consequence of relative anchoring of 
the various parts of the protein on the DNA "scaffold" . Thermal fluctuations that the associated protein is subject 
to are generally of the order of ~ fc^T, and their main effect is protein translocation along the DNA. From the above 
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analysis, it follows that the translocation actually takes place only if the protein encounters barriers of (Tg ~ ksT on its 
way. In a large enough collection of sites {M ^1), however, potential "wells" of depth ~ (JsV2 InM will be present. 
If the "well" depth is larger than the folding barrier height, the probability of on-site ("in- well") folding increases, 
leading eventually to a stable complex formation. More detailed computational analysis of coupling between folding 
and binding will be published elsewhere. 

C. Biological implications 

Mechanism of 3D/1D search described above has several biological implications. Needless to say that studied 
model (as any quantitative model) is a gross simplification of protein-DNA recognition in vivo. In spite of this 
simplification, proposed mechanism can be generalized to describe in vivo binding. Here we briefly discuss some 
biological implications of our model. 

1. Simultaneous search by several proteins 

If several TFs are searching for its site on the DNA, the total search time is given by equation H15() and is obviously 
shorter than the time for a single TF. For example, if 100 copies of a TF are searching in parallel for the cognate 
site, then assuming fccytopiasm ^ lO^M^^s"^ and a cell of l^m^ volume, we obtain the search time of tg w O.lsec. 
Increasing the number of TF molecules can further decrease the search time, but can have harmful effects due to 
molecular crowding in the cell. Note, however, that increasing the number of TF molecules to 100 — 1000 per cell 
cannot resolve the speed-stability paradox (see Fig. 5). 



2. Search inside a cell: molecular crowding on DNA and chromatin 

Above we assumed that a TF is free to slide along the DNA. In vivo picture is complicated by other proteins and 
protein complexes (nucleosomes, polymerases etc) bound to DNA, preventing a TF to slide freely along DNA. What 
are the effects of such molecular crowding on the search time? 

Our model suggests that molecular crowding on DNA can have little effect on the search time if certain conditions 
are satisfied. Obviously, the the cognate shall not be screened by other DNA-bound molecules/nucleosomes. DNA- 
bound molecules can interfere with the search process by shortening regions of DNA scanned on each round of ID 
diffusion. If, however, the distance between DNA -bound mo l ecules /nucleosomes in the vicinity of the cognate site is 



greater than ^opt ~ 300 — 500 bp (eq. (|13|l and IjKim et oil llQSTT ). then obstacles on the DNA do not shorten the 



rounds of ID diffusion and, hence, do not slow down the search process. Our analysis also suggests that sequestration 
of part of genomic DNA by nucleosomes can even speed up the search process. 

If DNA-bound proteins are separated by more than 300 — 500bp E.coli genomic DNA can accommodate 4.6 x 
lO^bp/BOObp « 1.5 X 10"* proteins. In other words all 150 known and predicted E.coli TF can be simultaneously 
present in 100 copies of each and search for their cognate sites without affecting each other. (In fact they can be 
present in 200 copies each since optimal search requires 50% of proteins to be in solution at any time). On the other 
hand, a short ^ 50bp linker between nucleosomes in eukaryotic chromatin can increase the search time about 10-fold. 
Details of this analysis will be published elsewhere. 
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3. "Funnels", local organization of sites 



Several known bacterial and eukaryotic sites tend to cluster together. One may suggest that such clustering or 
other local arrangement of sites can create a "funnel" in the binding energy landscape leading to a more rapid binding 
of cognate sites. Our model suggests that even if such "funnels" do exist, they would not significantly speed up the 
search process. Proposed search mechanism involves Af/fiopt ~ 10^ rounds of 1D/3D diffusions. So a TF spends 
all the search time far from the cognate site. Only the last round (out of 10"*) will be sped up by the "funnel", 
leading to no significant decrease of the search time. 

Local organization of sites and other sequence-dependent properties of the DNA structure (flexibility of AT-rich 
regions, DNA curvature on poly- A tracks etc) may influence preferred localization of TFs and lead to faster on-/off- 



binding rates and fast equilibration on neighboring sites (see (jSlutskv et all 12004} for details) 



4. Protein hopping: intersegment transfer 

Our model assumed that rounds of ID diffusion are separated by periods of 3D diffusion. Intersegment transfer is 
another mechanism that can be involved. If two segments of DNA come close to each other, a TF sliding along one 
segment can "hop" to another. The benefit of this mechanism is that it significantly shortens the transfer time t^d. 
Several experimental evidences suggest that tetrameric Lad, which has two DNA-binding sites, travels along DNA 
through ID diffusion and intersegment transfer. 

We did not consider this mechanism because of the two following considerations. First, it is unclear whether TFs 
that have only one binding site can perform intersegment transfer. Second, for this mechanism to work, distant 
segments of DNA need to come close to each other. While DNA packed into a cell/nuclear volume crosses itself every 
~ 500bp, DNA in solution (at in vitro concentrations) is unlikely to have any such self-crossings. Hence intersegment 
transfer cannot explain "faster than diffusion" binding rates observed in vitro. This mechanism however may play a 
role in vivo, especially for proteins that have multiple DNA-binding sites. 



D. Proposed experiments 

Our results propose several experimentally testable predictions. 

First, we predict that the maximal rate of binding is achieved when the protein spends half of the time in solution 
and half sliding along the DNA. This result can be readily verified experimentally by measuring concentration of free 
protein in solution that contains DNA but no cognate site. We also show how the search time depends on the energy 
of non-specific binding, that, in turn, can be controlled by ionic strength of solution or by engineering proteins with 
stronger or weaker non-specific binding. In vivo observation of the "50/50" rule would suggest that proteins are 
optimized by evolution for rapid search. 

Second, we show how binding rate depends on the average travel time between two random segments of DNA, r^d- 
Time r^d depends on the DNA concentration and domain organization of DNA. By changing DNA concentration 
and/or DNA stretching in a single molecule experiment one can alter r^d and thus study the role of DNA packing on 
the rate of binding. This effect has implications for DNA recognition in vivo, where DNA is organized and domains. 
Similarly one can experimentally measure and compare with analytical predictions the binding rate in the presence 



19 



of other DNA-binding proteins or nucleosomes. 

Single molecule experiments and AFM/SFM imaging allow direct observation of protein trajectory and measure- 
ment of the ID diffusion coefficient, Did on non-cognate DNA. Our formalism, in turn, allows to calculate the 
spectrum of specific binding energy given Did- Such measurements can be direct tests of our conjecture that ID 
search along non-cognate DNA proceeds along a "smoother" energy profile. 

Third, using protein engineering one can stabilize unstructured regions of DNA-binding proteins (e.g. A cl, Eco 
RV and GCN4) and study binding rates of these engineered rigid protein. Such experiments can test proposed 
search-and-fold mechanism and shed light on the role of unstructured regions in determining stability, specificity and 
binding rates. 

We also suggest that proteins bou nd to non-cognate DNA are not fully ordered. Unfortunately very few studies 



i Kalodimos et al. 



2004 . 



2002, 



200l[l have addressed the mechanisms of binding to non-cognate DNA. More studies 



of structures, thermodynamics and dynamics of proteins bound to non-cognate DNA will deepen our understanding 
of specific protein-DNA recognition. 



VII. CONCLUSIONS 

We have developed a quantitative model of protein-DNA interaction that provides an insight into the mechanism 
of fast target site location. We found the range of parameters (specific and non-specific binding energies) that are 
crucial for fast search and, hence, robust functioning of gene transcription. Paradoxically, realistic energy cannot 
provide both rapid search and strong binding of a rigid protein. This allowed us to formulate speed-stability paradox 
of protein-DNA recognition (which is similar to famous Levinthal paradox of protein folding). To resolve this paradox 
we proposed a search-and-fold mechanism that involves the coupling of protein binding and protein folding. 

Proposed mechanism has several important biological implications explaining how a protein can find its site on 
DNA in vivo in the presence of other proteins and nucleosomes, and by simultaneous search of several proteins. 
Our model provides, for the first time, quantitative framework for analysis of kinetics of transcription factor binding 
and, hence, gene expression. Importantly, our model links molecular properties of transcription factors to the timing 
of transcription activation. Proper understanding of the entire mechanism will hardly be possible without further 
experimental effort in these directions. 
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Appendix A: Diffusive properties of the DNA. 

The derivation consists of two steps. First, we describe the random walk along the DNA in terms of number of 
steps. Next, we calculate the mean time between successive steps in a random energetic landscape which provides 
the time - scale for the problem. Such a decoupling, strictly speaking, does not hold when the number of steps is 
small, i. e. when the number of visited sites is small and the random quantities are not averaged properly. However, 
since we are dealing with large numbers of steps (~ 10^ — 10^) this approach is legal, which is also confirmed by 
numerical simulations. 



The MFPT. 



To derive the diffusion law, we calculate the mean first passage time (MFPT) from site #0 to site defined as 
the mean number of st eps the particle is to ma ke in order to reach the site for the first time. The derivation 
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here follows the one in ijMurthv and Kehil 

Let Pij (n) denote the probability to start at site #i and reach the site #j in exactly n steps. Then, for example, 



Pi,i+i (n) = piTi (n - 1) , 



(21) 



where Ti (n) is defined as the probability of returning to the i-th site after n steps witliout stepping to the right of 
it. Now, all the paths contributing to Ti [n — 1) should start with the step to the left and then reach the site jfi in 
n — 2 steps, not necessarily for the first time. Thus, the probability Ti [n — 1) can be written as 



Ti {n - 1) = ^ Pi-i,i (m) Ti (I) Sm+i,n-2- 
We now introduce generating functions 

oc oc 



n=0 



n=0 



One can easily show (see e. g. l|Goldhirsh and Gefenl . ll986|) ) that 



L-l 



Knowing Pi.i+i (z), one calculates the MFPT straightforwardly as 



1^0, L 



E„ 'nPp.L jn) 

L-l 

4=0 



az 



2 = 1 



Using |(ni and we obtain the following recursion relation for Pi^i+i (z): 

ZPi 



{z) 



(22) 



(23) 



(24) 



(25) 



(26) 



1 - zqiPi-i^i (z) 

To solve for io.Li we must introduce boundary conditions. Let po — 1, qo — 0, which is equivalent to introducing 
a refiecting wall at i = 0. This boundary condition clearly infiuences the solution for short times and distances. 
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However, as numerical simulations and general considerations suggest, its influence relaxes quite fast, so that for 
longer times, the result is clearly independent of the boundary. The benefit of setting pa = 1 becomes clear when we 
observe that 

Po.i(l) = l ^ V^ A,,+i(l) = l. (27) 

Hence, 

L-l 

to,L = J2 PU+i ■ (28) 



i=0 



The recursion relation for P'n_^i (1) is readily obtained from (|26|l 



PU+l (1) = - + -PLl,^ (1) = 1 + [l + H-l. (1)1 , (29) 

with ai = Pi/ qi- Thus, the expression for to,L is obtained in closed form 

L-l L-2 L-l i 

io,L = i+^afe + ^ (l + Ofe) n (30) 

fc=0 fc=0 i=fc+l j=k+l 

This solution expression gives MFPT in terms of a given realization of disorder producing a certain set of probabilities 
{pi}, whereas we are interested in the behavior averaged over all realizations of disorder. The cumulative products 
in H3l)|l reduce to the two form e^^^^~^^\ which after being averaged over uncorrelated Gaussian disorder produce a 
factor of e'' . After the summations are carried out, the expression for MFPT becomes for L ^ 1 

(^o,L)^LV'^^ (31) 

Thus, the diffusion law appears to be the classical one, with a renormalized diffusion coefficient. 

The time constant. 

Consider a particle at site #i. The particle will eventually escape to one of the neighboring sites #(i ± 1), the 
escape rate being 



(32) 



To calculate the characteristic diffusion time constant (t), this rate should be averaged over all configurations of 
disorder {Ui\. To obtain an analytic expression for the (t), we assume the form 

c.,,±i = ^e-^(^'±-^') (33) 

for both Ui±i > Ui and Ui±i < Ui , as opposed to the form Q. Numerics show that this approximation introduces 
an up to ~ 15% error for small values of Pa and is practically exact for /3(t > 2. Thus, 

2to V / 
where tq ~ l/{2i'). The mean time between the successive steps can be calculated therefore as the average over all 
possible configurations of Ui, Ui±i of the reciprocal of the escape rate, i. e. 
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Assuming as above Gaussian energy statistics, this integral is evaluated as follows 



TO ef" - 1^ r . . e 



After the change of variables 



the integral factorizes leading to 



s = -^(a; + y), (37) 





2^ „ 


To 








To 





-oo COS 

g-tV2-ln[cosh(/3at/\/2)] ^gg^ 



g-t^(l+^VV2)/2 ^ g3/3VV4 [1+^2^2/2] 



-1/2 



V27r 

Now, multiplying by (r), we obtain the diffusion coefficient as 



1 / f3'<j 



■2^2 \ 1/2 



Appendix B: Non-specific energy 



To find how the non-specific energy E^g is related to the average time Tid protein spends scanning a single region 
of the DNA we use simple observation that 

^^T,r,^=l ^E^'^=^i'i (40) 

which states that "on average" protein dissociates ones from the region it scans. 

Since some massive hopping from site to site takes place before the particle eventually dissociates, the dissociation 
rates and, consequently, the non-specific binding energy should satisfy the following equation 

(E-^-») = ^ (E^^e-^^''"'-^'^) = V,r ^''^''^'^''^^(U)f(U)dU = 1, (41) 
and this subject to a condition 

(E^O ^ ^(^)/(^)^^ = ^i'^- (42) 



Where is the time TF spends at the i-th site and Tid is the average time of ID search to dissociation. The average 
lifetime ti — t {Ui) at that site is proportional to cxp {—j3Ui). In this specific case, the particle usually escapes to 
one of the neighboring sites, and we should average over their energies. Hence, the explicit form r (U) as calculated 
from (021) is 

r(C/) = n,e-'3^'^V2e-;3a^ (43) 

Substituting this into (|41|l . we have 

^-i^-^0'^'-PE„.^l^ (44) 



To 
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Ens ^ ksT 



In 



To 



2 VfcsT 



(45) 



Next we recall that, in the optimal regime, ti^i — t^^. Thus, to ensure optimal performance, E^s should be equal 
the expression in (|45|l with Tid replaced by f^d'- 



' TO / 2 KkeT 



(46) 



The meaning of this relation is quite transparent. The logarithm gives E^s in a system with zero or constant specific 
binding energy. The second term introduces suppression of E^a due to disorder, so that the dissociation events 
in a system with disorder are more frequent to compensate partially for the ID diffusion slowdown. This relation 
obviously holds as long as E-^s > 0. Negative values of E^s mean simply that the non-specific interaction became 
overshadowed by the specific one and has no direct physical sense anymore. 

Since for a given value of a, the non-specific binding controls the dissociation rate, the search time will deviate 
from the optimum if E^s moves from this predetermined value. In Fig. 3a we plot the search time as a function of 
the non-specific binding energy for different values of a. 

We now define the tolerance factor C, as the ratio between the maximal acceptable value of the search time tg and 
the minimal time tsQ. Experimental data suggest C ^ 5, but we for the moment allow for much larger values of 

^ 10 — 100 (this can be done when, for instance, there are many protein molecules searching in parallel). As we 
can see from Fig. 3a, for each value of a, there is a range of possible values of E^s such that the resulting search time 
is within the region of tolerance. This range is easily calculated producing the values of non-specific energy between 



' Dld{0')T3d 

Did{0)To 



DidjO) 
Didia) 



a^l3 
2 



(47) 



Appendix C: Role of DNA conformation 

Central parameter here is r^di the interval of time between a dissociation of the protein from DNA till the 
next binding to DNA. Exact calculation of r^d is a very difficult task, considering the nontrivial packaging of the 
DNA molecule inside a bacterial cell, electrostatic effects and the inhomogeneity of the cytoplasm. Considering the 
microscopic picture one can easily obtain a reasonable estimate of r^d as a characteristic time of 3D diffusion across 
the nucleoid (the region of a bacterial cell to which the DNA is confined) . The corresponding diffusion length depends 
on the conformation of the DNA molecule. Indeed, if the DNA molecule was a single homogeneous globule, there 
would be a single relevant length scale, which is the molecule characteristic size l^i (the gyration radius). On the other 
hand, as Fig.|7|shows, diffusion of a protein molecule inside a more realistic non-homogeneous multi-domain molecule 
involves at least one addit ional length scale LjWhich is a characteristic size of a domain. These two lengths may 



differ by a factor of ~ 10 IjNeidhardt et al 



1993), making the ratio of the resulting diffusion times t^/tj^ ^ 10^. 
In the original problem (a single protein molecule searching for a single site on the DNA), the search process is 
dominated by the larger time-scale, since at least few domains must be explored before the target site is located. 
However, there are usually about 10^ TF molecules present in a cell, so it is reasonable to assume that the domains 
are scanned in parallel, making the inter-domain transfer processes irrelevant. 
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Appendix D: Stability requirement 

In fact, it is not hard to estimate analytically the [alksT) ratio for a genome of length M such that the probability 
of binding to the lowest site is comparable to the probability of binding to the rest of the genome. , i. e. their 
contributions to the partition function are of the same order of magnitude. The partition sum for the Gaussian 
energy level statistics is. 



so that for M = 10^ 



^/2^ J-c 

exp [-/Jt/min] exp (/3cr\/21nA//) (48) 



a - kerVThJA ~ 5 keT. (49) 



Strictly speaking, for a large though finite set of energy levels, the integration limits are cut off at ±o-\/21nM so that 



for (3a ^ Vln M the partition function is dominated by the lower edge of the distribution. The estimate for (3a gives 
therefore the crossover value between the regime of multiple-site contribution to Q and the regime with single-site 
domination"^ . 

If Np proteins are searching and binding a single target site, then the probability of being occupied is given by 

P(iVp) = 1 - (1 - Pbf' « NpPb (50) 

where Pi, is the probability of the site being occupied by a single protein (eq 19 of the paper) and approximation is 
for Pf, << 1/Np. As evident from Fig 4b, requirement of the rapid search is satisfied if Pb{a/T « 1) ~ 10^^. An 
unfeasible amount of 10'' copies of a single TF are required to saturate such weak binding site. 



^ In the Random Energy Model iDerridal I1981D . the analog of this effect would be the thermodynamical freezing. 
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Appendix E: Energy Gap 

Large energy gap between the cognate site Sc and the bulk of genomic sites would solve the paradox of rapid search 
and stability. One may seek parameters e(j, s) of the energy function 

I 

U{s = Si,..s.,+i-i) = ^e{j,Sj), (51) 

to maximize the energy gap by minimizing the Z-score 

. El!^, (52, 
a 

where averaging and variance is taken over all possible sequences of length / (or over genomic words of length I). It's 
easy to see that Z{sc) is minimal if 

e°P*(j- s) = -<5(s,se,) (53) 

where d{x, y) is Kronecker delta. For K types of nucleotides assuming their equal frequency in genome we obtain 
the maximal reachable energy gap of 

Z""" = -VlK. (54) 

For K = A and Z « 8 we get 2'™'" w —5. For the genome of 10^ — lO^bp the energy spectrum of the genomic DNA 
ends at Z sa —5. While sufficient to provide stability of the bound complex (see main text), such energy gap is 
unable to resolve the search-stability paradox. 



Appendix F: Diffusion in water and in cytoplasm 



The diffusion coefficient of a protein molecule in water can be estimated as l|Landau and Lifshitz 



1987) 



D 



(55) 



where d is the diameter of the molecule and ry is the water viscosity. Setting ry ~ 10 ^ g/(sec • cm) and d ^ 10 nm, 
we obtain at room temperature 



D - 10^ /imVsec. 



(56) 



Diffusion coefficient measurements for GFP in E. coli lElowitz et al 



1999(1 produce values of about 1 — 10 /xm^/sec. 



This difference in diffusion coefficients may account for more than order of magnitude difference in the theoretically 
calculated and measured target location times. 
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